[Bugfix] Fix reasoning end token missed by should_advance under async scheduling + spec decode#43526
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves the reliability of reasoning end token detection in structured output by passing new_token_ids directly to the should_advance method. This update avoids potential issues with placeholder arithmetic during asynchronous scheduling and speculative decoding. The changes include new unit tests covering various scenarios and an additional test case for Qwen3 models. I have no further feedback to provide as there were no review comments.
9683efb to
14c6e4d
Compare
… scheduling + spec decode Fixes vllm-project#43388, vllm-project#34650 When async scheduling and speculative decoding are both enabled, should_advance() reconstructs the new tokens delta via placeholder arithmetic (num_computed_tokens - num_output_placeholders). After spec decode rejection adjustments and token appending, this arithmetic can produce start == len(all_token_ids), yielding an empty delta that misses the reasoning end token (e.g. </think>). As a result, reasoning_ended never flips to True and JSON grammar constraints are never applied. Fix: add an optional new_token_ids parameter to should_advance(). When the caller has the actual new tokens (the token-output path in update_from_output), pass them directly so the method checks them instead of re-deriving the delta from counter arithmetic. The fallback path (draft-token call sites) is unchanged. Signed-off-by: Raghavan <oneraghavan@gmail.com>
14c6e4d to
b5bb916
Compare
|
new_token_ids=[248069, 271, 71093] After changing the delta window to include the full new_token_ids batch, vLLM can correctly set reasoning_ended=True. However, any tokens generated after in that same batch were still sampled before the JSON grammar constraint became active. This means the fix can make apply=True from the following step onward, but it cannot retroactively constrain, reject, or remove the already-generated post-thinking tokens. If those tokens include orjson, the final content may still contain Markdown fences. There is also a related grammar state synchronization issue. If the first JSON token after has already been generated in the same speculative batch but was not accepted by the grammar FSM, the next constrained step may still expect the initial {, causing duplicated opening braces such as: {{ how about this problem? |
|
Closing this PR in favor of #43424, which takes a more comprehensive approach. Our fix here addresses the detection side (passing #43424 solves this with a pre-commit filter (
Credit to @sfbemerk for the original analysis in #36138 that charted the path for all of these fixes. |
Purpose
Fixes #43388, #34650
When async scheduling and speculative decoding are both enabled,
should_advance()misses the reasoning end token (e.g.</think>) because placeholder arithmetic produces an empty delta window.Root Cause
should_advance()computes a delta slice of recently-generated tokens via:Under async scheduling + spec decode:
num_computed_tokensis pre-incremented by_update_after_schedule(before GPU runs)update_from_outputnum_output_placeholdersfollows a parallel increment/decrement cycledelta_from == len(all_token_ids), yielding an empty deltaThe reasoning end token sits in
all_token_idsbut the delta window walks past it.reasoning_endednever becomesTrue, and JSON grammar constraints are never applied.Observed in production (from issue #43388):
Fix
Add an optional
new_token_idsparameter toshould_advance(). At the token-output call site inupdate_from_output(the only path wherenew_token_idsis naturally available), pass the actual tokens directly. This bypasses the fragile placeholder arithmetic entirely.The fallback path (draft-token call sites in
update_draft_token_ids/update_draft_token_ids_in_output) is unchanged — those rely onreasoning_endedalready being set by the priorupdate_from_outputcall.Changes
vllm/v1/structured_output/__init__.pynew_token_idsparam toshould_advance(); use it as delta when providedvllm/v1/core/sched/scheduler.pynew_token_idsat theupdate_from_outputcall sitetests/v1/structured_output/test_reasoning_structured_output.pytests/entrypoints/llm/test_struct_output_generate.pyTest Plan
Unit tests (4 new, all pass locally):
test_should_advance_with_new_token_ids_detects_reasoning_end— end token found vianew_token_idsregardless of placeholder statetest_should_advance_async_spec_decode_empty_delta_misses_end_token— reproduces the exact bug (empty delta), then showsnew_token_idsfixes ittest_should_advance_new_token_ids_structural_tag_spec_decode— structural tag + spec decode same-stepTrue(no regression)test_should_advance_new_token_ids_no_end_token— negative case: no end token →reasoning_endedstaysFalseE2E test (new matrix entry):
Qwen/Qwen3-1.7B+qwen3reasoning parser + ngram spec decode +async_scheduling=True— the exact combination that triggers the bugExisting tests (all 13 pass):
Related Work
needs-rebase). That PR addsidentify_constrained_draft_tokens()/_find_reasoning_end_in_tokens()across 3 files (+403/-95). This PR achieves the same detection fix with +25/-11 lines in 2 files.delta_from = num_computed_tokens - num_output_placeholders. That formula is correct for async-without-spec-decode but breaks when spec decode rejection adjustments shift the counters.